You can download the rmd file here.
ggplot2 follows a theory of data visualization called the grammar of graphics. You can summarize this grammar as:
Each graph has the following components:
data: the dataset containing the variables you want to visualizegeom: the type of geometric object you want to graph (i.e., bars, points, boxplots)aes: the aesthetic attributes you want to apply to the geometric object (including which variables should be on the x & y axis, the color, shape, and size of the geometric object)Here is a general ggplot template:
You don’t need to remember the syntax! Here’s the
A ggplot object can have multiple components (connected with +), which specify a layer on the graph.
# load a dataset
data(CPS85, package = "mosaicData")
# check the structure
str(CPS85)
## 'data.frame': 534 obs. of 11 variables:
## $ wage : num 9 5.5 3.8 10.5 15 9 9.57 15 11 5 ...
## $ educ : int 10 12 12 12 12 16 12 14 8 12 ...
## $ race : Factor w/ 2 levels "NW","W": 2 2 2 2 2 2 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 1 1 2 1 1 2 2 1 ...
## $ hispanic: Factor w/ 2 levels "Hisp","NH": 2 2 2 2 2 2 2 2 2 2 ...
## $ south : Factor w/ 2 levels "NS","S": 1 1 1 1 1 1 1 1 1 1 ...
## $ married : Factor w/ 2 levels "Married","Single": 1 1 2 1 1 1 1 2 1 1 ...
## $ exper : int 27 20 4 29 40 27 5 22 42 14 ...
## $ union : Factor w/ 2 levels "Not","Union": 1 1 1 1 2 1 2 1 1 1 ...
## $ age : int 43 38 22 47 58 49 23 42 56 32 ...
## $ sector : Factor w/ 8 levels "clerical","const",..: 2 7 7 1 2 1 8 7 4 7 ...
ggplot(data = <data>, mapping = aes(x = <x-axis variable>, y = <y-axis variable>))# generate a univariate graph with a categorical variable
ggplot(data = CPS85, mapping = aes(x = sex))
We need to make sure the categorical variable is a factor, and we can adjust the labels and the order of the categories using the parameter levels
# check the class of the variable
class(CPS85$sex)
## [1] "factor"
# rename the labels
CPS85_clean <- CPS85 %>%
mutate(sex = recode(sex, F = "Female", M = "Male"))
ggplot(data = CPS85_clean, mapping = aes(x = sex))
# change the order
CPS85_clean %>%
mutate(sex = factor(sex, levels = c("Male", "Female"))) %>%
ggplot(mapping = aes(x = sex))
ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
geom_bar()
add filled color by specifying the fill parameter, and shaple color by specifying the color parameter
ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
geom_bar(fill = 'darkorange', color = 'black')
Fill the bars with colors based on the levels of a categorical variable by assigning the catigorical variable to fill. Note: When assigning a variable to fill, it has to be inside the same aes() as the associated variable.
ggplot(data = CPS85_clean, mapping = aes(x = sex, fill = sex)) +
geom_bar(color = 'black')
# this doesn't work
# ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
# geom_bar(fill = sex, color = 'black')
# this works
ggplot(data = CPS85_clean) +
geom_bar(aes(x = sex, fill = sex), color = 'black')
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram()
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10)
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10, alpha = 0.7)
Specify the categorical variables that determine the color with fill and the types of bar graph by positon
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "stack")
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "dodge")
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "fill")
specify the continuous variable on the y-axis with y= and specify stat = "identity" inside geom_bar
ggplot(CPS85_clean, aes(x = sector, y = exper)) +
geom_col(fill = "darkorange", alpha = 0.7)
specify the continuous variable on the x-axis and the categorical variable with fill
ggplot(CPS85_clean, aes(x = exper, fill = race)) +
geom_density(alpha = 0.4)
specify the continuous variable with y=
ggplot(CPS85_clean, aes(x = sector, y = exper)) +
geom_boxplot()
reorder the boxplots by the continous variable
ggplot(CPS85_clean) +
geom_boxplot(aes(x = reorder(sector, exper), y = exper), color = "darkorange", alpha = .7)
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange")
Add linear fit line by add a layer of geom_smooth, with specified method
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(method = "lm")
Specify the grouping variable with color by adding color to aes
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point(color= "darkorange") + # parameters specified outside of ggplot will override the previous settings
geom_smooth(method = "lm")
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() + # keep the color pattern for the dots
geom_smooth(method = "lm")
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() + # keep the color pattern for the dots
geom_smooth(method = "lm") +
facet_wrap(~race)
Adjust the order with limits and label with labels inside the scale_x_discrete layer.
# check the current levels of the factor
levels(CPS85_clean$race)
## [1] "NW" "W"
ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
geom_bar() +
scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor
labels = c("White", "Non-White")) # need to match the order of the limits
Customize lagend by specify parameters inside scale_fill_discrete
ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
geom_bar() +
scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor
labels = c("White", "Non-White")) + # need to match the order of the limits
scale_fill_discrete(name = "Race", labels = c("Non-White", "White"))
specify the min, max and interval with scale_x_continuous(breaks = seq())
# check the range
range(CPS85_clean$age)
## [1] 18 64
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(18, 64, 5)) # have to be within range
Add dollar sign
# check the range
range(CPS85_clean$age)
## [1] 18 64
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(18, 64, 5)) + # have to be within range
scale_y_continuous(labels = scales::dollar)
Specify all labels with labs
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(method = "lm") +
labs(title = "A positive correlation between age and experience",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Experience (year)",
caption = "Data taken from the `mosaicData` package.")
Specify text_settings
text_settings <-
theme(plot.title = element_text(size = 16, face = 'bold')) +
theme(plot.subtitle = element_text(size = 14)) +
theme(axis.title.x = element_text(size = 16, face = 'bold')) +
theme(axis.title.y = element_text(size = 16, face = 'bold')) +
theme(axis.text.x = element_text(size = 10)) +
theme(axis.text.y = element_text(size = 10)) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(method = "lm") +
labs(title = "A positive correlation between age and experience",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Experience (year)",
caption = "Data taken from the `mosaicData` package.") +
theme_minimal() +
text_settings
Add a density plot on to histogram. Need to change the y-axis to density
ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10) +
geom_density(color = 'steelblue', size = 1.1) +
facet_wrap(~sex)
Assign figures into variables, then orangize multiple figures using plot_grid
library(cowplot)
wage_hist <- ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10) +
geom_density(color = 'steelblue', size = 1.1) +
facet_wrap(~sex) +
labs(title = "Wage distribution by gender") +
theme_bw() + # add a theme
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
wage_age_plot <- ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(18, 64, 5)) +
labs(title = "Associations between Wage and age") +
theme_light() + # add a theme
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
plot_grid(wage_hist, wage_age_plot)
ggplot gives you the flexibility to customize almost everything. Data visualization is an art, but also it’s an important way of communication. Therefore, even if I would like to spend hours on finding the perfect color combination, increase the clarity and interpretability of your data should always be your priority. So, before deciding the colors, you may want to make sure the color palettes you use have sufficient contrast and are color-blind friendly.
The minihacks today are intentionally very open-ended. Get as creative as you want!
Data visualization is a great way to uncover stories in the data that would be difficult to notice by just looking at the numbers. See what stories you can uncover by exploring individual variables and their relationships with each other.
load the SaratogaHouses dataset from the mosaicData package
data(SaratogaHouses, package="mosaicData")
1a. Create visualizations for the heating variable and the livingArea variable. Add as many customization features as you want (e.g., color, labels, text settings, themes, etc.).
1b. Bonus: Can you find a way to highlight the most commont heating pattern and mark the average living area on the figure?
2a. Create visualizations to demonstrate whether newly constructed houses have different heating patterns or not.
2b. Create a histogram on price with different colors representing different fuel type.
2c. Create visualizations to demonstrate whether the age of houses differs by fuel type
3a. Create visualizations to demonstrate the association between age of the houses and price?
3b. Create visualizations to demonstrate whether the association between age of the houses and price depend on the waterfront and centralAir of the house?